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Abstract 


The Fairchild CLIPPER microprocessor is a new high performance 
three chip module consisting of a microprocessor chip and two cache and 
memory management (CAMMU) chips, mounted on a small PC board. 
CLIPPER implements a new instruction set architecture which has been 
designed for high performance, convenient programmability, broad func- 
tionality and sufficient architectural “openness” to permit future evolution 


and a variety of implementations. 


In this paper, we (a) describe the instruction set architecture of 
CLIPPER, (b) describe the chip design architecture and the interesting 
features of the implementation, and (c) consider in some detail the reasons 
for various design decisions and tradeoffs. Performance estimates are pro- 
vided. Possible future directions for both performance and instruction set 


architecture are outlined. Some comments on the RISC vs. CISC issue are 


given. 


+CLIPPER is a trademark of Fairchild Semiconductor Corporation 
*Pairchild Semiconductor Corporation, 4001 Miranda Avenue, Palo Alto, Ca., 94304. 
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cy (DoD), under Arpa Order No. 4871, Monitored by Naval Electronic Systems Command under Contract No. 
N00039-84-C-0089 Some research results obtained under this funding are presented in this paper. 


1. Introduction 


1.1. Summary of Features 


The Fairchild CLIPPER*" employs a new high performance computer archi- 
tecture implemented initially as a three chip module, consisting of a processor chip 
and two cache and memory management unit (CAMMU) chips (see figure 1); the 
processor is also available separately. It uses a new instruction set which is 
“simplified” and “RISC-like” but not RISC. The machine has a 32-bit architec- 
ture, with a 32-bit bus data path, 32-bit registers, 32-bit data paths on chip and a 
separate 32-bit virtual address space for the system and for each user address 
space. There are nine addressing modes, permitting memory addresses to be com- 
puted from most of the useful combinations of the program counter, register con- 
tents and/or a displacement of 12, 16 or 32 bits. Instructions are 2, 4, 6 or 8 bytes 
long, with their length, address mode, and opcode specified in the first two bytes 
for efficient decoding. Data types include bytes, halfwords, words (32 bits), long- 
words (8 bytes), and single (4 bytes) and double (8 bytes) precision floating point. 
Three user visible register sets are available: 16 user and 16 supervisor general 
purpose 32-bit registers, and 8 floating point registers of 64 bits each. There are 
also the usual control registers (program counter, program status word, system 
status word) and some internal registers used by the processor. Eighteen traps are 
implemented and there is provision for 128 system calls. Floating point operations 
conform to the IEEE 754 standard [Cody84]. 


The CLIPPER microprocessor has been designed with virtual memory as the 
standard mode of operation. The associated CAMMU chips each contain a 4Kbyte 
cache, a translation lookaside buffer (TLB) and a translator. One CAMMU is used 
for instruction references and the other for data; the CAMMUs not only provide 
caching but also implement protection, detect page faults, and watch the system 
bus in order to ensure multiple cache consistency. A full 32-bit address space is 
provided for the operating system and for each user process; the address space is 
not partitioned via high order address bits. 


The floating point unit is on the CLIPPER processor chip. Instruction execu- 
tion is pipelined with up to five instructions in the pipeline. Interlocks and depen- 
dency checks are provided in the pipeline hardware, so that no compiler inserted 
no-ops are needed for correct operation. A few complicated operations and diag- 
nostics are implemented as instruction sequences in a small, on-chip ROM, called 
the Macro Instruction ROM (MIROM); all other instructions are hardwired. No 


+CLIPPER is a Trademark of Fairchild Semiconductor Corporation 


+The trademark “CLIPPER” was chosen in a reflection of the preference of the principal architect and program manager for 
spending his weekends sailing. 
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microcode is used. The machine has 168 instructions, of which 101 are directly 
hardwired. 


The processor chip is implemented in 2 micron CMOS, is 156K square mils 
and uses 132,000 transistors. Performance estimates show that the current imple- 
mentation is somewhat faster than a VAX 8600, which is itself generally referred 
to as a “4-MIPS” machine; CLIPPER is thus a 5 MIPS computer; the peak execu- 
tion rate in CLIPPER instructions is 33 MIPS. Additional information on 
CLIPPER is available in [Fair86, Cho86]. 


1.2. Motivation and Design Philosophy 


The decision to design and build CLIPPER was made in the belief that there 
existed and exists the need for a very high performance computer based on a 
microprocessor chip. The immediate applications for such a processor are in high 
performance workstations and for use in “super-minicomputer” shared machines. 
To introduce some historical perspective, the highest performance commercial 
mainframe in 1976 was the IBM 370/168, which for the kind of workloads expected 
on CLIPPER (C, Fortran, Pascal), had performance comparable to that of 
CLIPPER. 


It is the belief of the CLIPPER designers that no existing commercial com- 
puter architecture in 1982-3 met the requirements of: (a) permitting a high perfor- 
mance implementation (b) on a microprocessor chip (c) with the necessary instruc- 
tion set and architectural features. Architectures then available on microproces- 
sors failed to permit high performance implementations, and most other architec- 
tures failed to either be easily implementable on a chip or failed to provide a rea- 
sonable range of features. There were also commercial barriers to the use of an 
existing architecture. The decision was thus made to design a new instruction set 
architecture, using the previous experience of the designers and the latest think- 
ing in the computer architecture research community. 


Fashions in computer architecture have varied widely over the last few years, 
changing from the baroque or rococo in the 1970s to the minimalist 1980's. It was 
widely believed in the 1970’s that hardware would very cheap, software was 
difficult and expensive, and that therefore as much functionality as possible should 
be moved to the hardware. The result was complex architectures such as the DEC 
VAX [DEC81, Levy80]. The problems with such a complex architecture are that it 
is very difficult to obtain good performance as a function of the amount of logic 
needed, and the machine is hard (time consuming, expensive) to design, build, and 
debug [Henn82,84]. 


The popular thinking in computer architecture shifted in the 1980s toward 
very simple architectures, as originally implemented in the Cray designed 
machines (CDC 6400, 6600, 7600), studied and implemented in the IBM 801 


Hollingsworth, Sachs, Smith -3- CLIPPER Processor 


{Radi81] and further studied and popularized by the RISC project at Berkeley 
(Patt85] and the MIPS project at Stanford {Henn84]. Such machines permit high 
performance implementations and rapid design and development but are less than 
ideal in terms of programmability; one becomes very dependent on sophisticated 
software technology to obtain good performance and guarantee correct operation. 
There is a saying attributed to Einstein [Lamp83], to the effect that “everything 
should be made as simple as possible, but no simpler.” Our feeling was that the 
“pure RISC” type architectures provided insufficient features and functionality for 
a commercial product, and that equivalent performance advantages were also 
available in a carefully designed architecture of “moderate” complexity. Some 
discussion of the RISC/CISC issue appears in [Colw85]. 


The choice was thus made to design a new instruction set architecture (ISA). 
The instructions, the module design, and the functional partitioning were chosen 
to permit mainframe level performance, and to permit future compatible main- 
frame implementations. The continuing and increasing adoption of the easily 
ported UNIX™™” [Rite74] as the standard operating system for academic, software 
development and workstation environments made the decision to use a new ISA 
commercially feasible. 


1.3. Outline and Context 


It is possible to describe a “computer” at many levels. The instruction set 
architecture (ISA) refers to the computer instruction set as expressed in binary or 
in assembly language and its function; the ISA is usually described in the Princi- 
ples of Operation manual. We use the term design architecture to refer to the 
highest level of description of an implementation, i.e., the block diagram and 
parameter level. Below that are gate and circuit level descriptions. 


This paper is primarily directed at the instruction set architecture of 
CLIPPER, with some of the material concerned with the design architecture and 
related issues such as performance, design tradeoffs, design implications and areas 
for possible future expansion. 


A brief summary of the memory architecture of CLIPPER is provided in the 
next section. Registers and modes of operations are discussed in section 3, instruc- 
tion formats and addressing modes in section 4, and the instruction set itself in 
section 5. Interrupts and traps are considered in section 6. Section 7 presents the 
essentials of the design architecture and implementation. Performance is dis- 
cussed in section 8 and design tradeoffs and possible extensions are reviewed in 
section 9. 


*UNIX is a trademark of AT&T Bell Laboratories 
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2. Memory Architecture and Data Types 


2.1. Memory Architecture 


In this section, we provide a brief overview of the memory architecture of the 
CLIPPER microprocessor. A much more detailed description, including a discus- 
sion of the CAMMU, is provided in [(Cho86], and that paper should be read as the 
companion to this one. 


In normal operation, CLIPPER uses virtual memory, although unmapped 
(real memory) mode is also possible. The supervisor and each user process has its 
own 32-bit virtual address space, defined by the PDO (page directory origin) regis- 
ter in the CAMMU, which contains the physical memory address of the base of the 
first level of the page map for the process. The page map is implemented in two 
levels, with the first level referred to as the page directory and the second level 
containing the page tables. The page size is 4Kbytes, which is large enough for 
efficient I/O [Smit81], keeps the TLB miss ratio down and provides enough 
unmapped bits that set selection in the 4Kbyte caches can be effectively over- 
lapped with translation [Smit82]. The page size is also small enough to avoid 
unreasonable levels of internal fragmentation. No address bits are used to parti- 
tion the address space, as is done in the VAX and MIPS machines [Demo86], so 
such a partitioning doesn’t constitute an obstacle to increased address space size as 
technology evolves. 


Two cache and memory management chips (see figure 1) provide most of the 
support for the memory architecture; one is used for data and the other for instruc- 
tions, each connected to the processor by its own 32-bit bus. Each CAMMU has a 
TLB (translation lookaside buffer) and a translator. The TLB is set associative 
with 128 entries organized as 64 sets of two elements each. Protection is provided 
on a page basis, with each page table entry specifying permission for the process to 
read, write and/or execute from the page in supervisor and/or user state; protection 
bits are cached in the TLB. Page faults, protection faults, and memory errors are 
detected by the CAMMU and a trap code is returned to the processor for supervi- 
sor action. 


Each CAMMU also contains a 4Kbyte cache memory, organized as 128 sets of 
two 16-byte lines. The caching policy (copy back, write through, uncacheable) is 
defined on a per page basis and can vary from page to page; caching policy bits 
are attached to each page table and TLB entry. The CAMMU is capable of 
“watching” the system bus and acting to maintain cache consistency when there 
are multiple CPUs on the bus and/or when I/O operations reference data resident 
in the local cache. 


The low order eight pages of the supervisor address are permanently mapped 
by the CAMMU to provide access to Boot ROM (residing on the system bus), I/O, 
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which is addressed via reads and writes to memory addresses, and low main 
memory. Trap and interrupt vectors reside in low memory. The CAMMUs are 
controlled by reads and writes to the I/O region of memory. 


Originally, CLIPPER was designed to use a consistent, “little endian” 
{Kirr83], numbering system for bits, bytes and words, in which the most 
significant bit is in the highest numbered bit of the highest numbered byte, has 
been defined, and internally, CLIPPER remains little endian. Figure 2 shows the 
instruction formats, in which the bit, byte and word numbering may be observed. 
The “first parcel” is the first two bytes of the instruction stream; the remaining 
bytes of the instruction or the bytes of the following instruction(s) will appear in 
the second, third, and fourth parcels. This numbering system is the same as is 
used in the DEC VAX and National 32000 [Hunt84]. This contrasts with the Sys- 
tem/370 [IBM76] in which the most significant bit is the lowest numbered bit of 
the lowest numbered byte; bits, bytes and words are numbered in increasing order 
from left to right, with the MSB at the left. The Motorola 68000 also uses a “big 
endian” scheme, but numbers bits in the opposite order from bytes and words 
[Moto82]. 


CLIPPER has been enhanced so that in its current version, it can function in 
either a little endian or big-endian mode. The appropriate byte order is selectable 
at power-up time by tying a pin to either +5v or ground. When operating in big- 
endian mode, CLIPPER does the following: (a) reverses the order of half-words in 
the instruction buffer, (b) reverses the order in which double word operands are 
loaded/stored, (c) changes the byte and half word addressing so as to reference the 
correct byte or half word within a word. When operating in big-endian mode, 
CLIPPER can function effectively in a system with big endian processors and/or 
data files created by big-endian machines. 


2.2. Data Types 


The selection of data types represents a compromise between apparent func- 
tionality, which is enhanced by a large number of data types, and implementabil- 
ity, which is easiest when the number of types is small. The data types supported 
by the CLIPPER architecture include signed and unsigned bytes, half words (2 
bytes), words (4 bytes) and long words (8 bytes). There are also single and double 
precision (4 and 8 bytes respectively) floating point numbers. This set of data 
types is sufficient to implement programming languages such as C, Fortran and 
Pascal, with direct hardware support provided for most language operations. (Ini- 
tially, as suggested in [Henn82], little support for bytes or half words was 
intended, but further examination of programming needs showed that more direct 
hardware support was required.) 
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We note that CLIPPER does not (at this time) provide as hardware specified 
data types decimal numbers, strings, or precision beyond that of long words or 
double precision floating point. Strings can be easily implemented via software; in 
addition, CLIPPER provides three string manipulation instructions (move, com- 
pare, fill) as MIROM sequences. Extended precision can be obtained via software 
when needed. 


CLIPPER also imposes alignment restrictions on data items. All data items 
must be stored on a boundary which is a multiple of its size [Neff86a]. This res- 
triction generally causes little difficulty, and considerably simplifies the processor 
implementation. For CLIPPER, there is no implementation problem with line 
crossers (fetch or store requests spanning a pair of cache lines) or page crossers 
(fetch or store requests spanning a page boundary.) 


3. Registers and Modes of Operation 


3.1. User and Supervisor General Purpose Registers 


There are two sets of 16 general purpose registers (GPRs), one referenced by 
user mode programs and one by supervisor mode programs. The mode of the pro- 
gram is determined by a bit in the SSW. There are two privileged instructions 
that allow data transfers between user and supervisor registers. 


The use of separate user and supervisor register sets speeds up interrupt and 
trap handling. The selection of 16 registers was determined by several factors, 
including the number of bits conveniently available for register addressing and the 
fact that 16 registers represent a good tradeoff; 16 registers are enough for local 
working storage without inducing unreasonable overhead for saving and restoring 
them at procedure call time. The C compiler provided by Fairchild [Neff86a] saves 
and restores only those registers that have been modified. For comparison, we 
note that both the VAX and the IBM 370 have 16 GPRs. Lunde’s results [Lund74] 
suggest that 8-10 registers are almost always sufficient. 


3.2. Floating Point Registers 


CLIPPER provides a set of eight double precision floating point registers 
accessible in both user and supervisor states; floating point instructions refer to 
these. This is similar to the IBM 370 design; in that machine there are four FP 
registers. Eight registers seem to provide sufficient storage for temporary 
operands, whereas four seem insufficient in the absence of memory to register 
operations other than load and store. (For non-numerically intensive programs, 
Lunde found that three floating point registers were usually sufficient. We expect 
a workload that is more numerically intensive than that analyzed by Lunde.) 
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3.3. Processor Status Registers 


Three additional program addressable registers are provided, the program 
counter (PC), the program status word (PSW) and the system status word (SSW). 
The program counter contains the address of the instruction about to be issued, 
i.e. the instruction in the pipeline that will be released and allowed to modify the 
processor state (write into a register or store a result). Not user addressable are 
the internal registers containing addresses of instructions following the currently 
issued instruction in the pipe. 


The program status word (PSW) is primarily used to hold status informa- 
tion (condition codes, trap codes) and to set those aspects of the processor state 
that the user process is permitted to modify, such as floating point trap enables. 
Four bits of condition code are provided (negative, zero, overflow, carry), and five 
bits of floating point exception status, as required by IEEE Std. 754, are also avail- 
able. Six bits are used to enable/disable floating point traps, and two more to 
specify the floating point rounding mode. A trace trap bit is available. Four bits 
are used to record program traps (e.g. trace trap, illegal operation), and four more 
to record system trap types (memory error, page fault, etc.). The PSW may be 
read or written by the user process. 


The last status register is the system status word (SSW). The SSW is used, 
among other things, to record the interrupt number and level, to enable inter- 
rupts, to set the mode (user/supervisor) and to set the protection key. The SSW 
may only be written in supervisor state. Its use is further described in [Cho86]. 


4. Instruction Formats and Addressing Modes 


4.1. Addressing Modes 


The CLIPPER microprocessor uses primarily a load/store architecture, i.e. 
most of the references to memory are via load and store instructions. This is con- 
trast to both the IBM 370 and DEC VAX which make extensive use of their 
register/memory operations (370 RX type instructions) and their memory to 
memory (370 SS type) instructions. The elimination of most RX and SS instruc- 
tions substantially simplifies the processor implementation by eliminating control 
logic and especially by simplifying recovery from traps and interrupts such as page 
faults and memory errors. Without the RX and SS type instructions, we expect 
that CLIPPER code will be slightly larger in size than that for the IBM 370 and 
VAX, but since all operands must always be specified, the increase should be 
small. Because we have variable length instructions with a variety of addressing 
modes, CLIPPER experiences a much smaller increase in code volume than the 
RISC {Patt85] and MIPS [Mous86] processors, both of which use only fixed length 
32-bit instructions. For RISC-I [Patt81], a 2/3 increase in number of instructions 
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over the VAX was observed, using a very primitive compiler for RISC. 


For load and store instructions, CLIPPER provides nine addressing modes, 
which appear in figure 2. These nine address modes represent those judged to be 
important for convenient programming plus those that come for “free;” i.e. can be 
trivially generated given the logic and data paths already available. For a 32-bit 
architecture, a register + 32-bit displacement mode (relative with 32-bit displace- 
ment) is very useful. The long 32-bit displacement eliminates the aggravating 
addressability problem posed by the 12-bit displacement of the IBM 370. The 
register +12-bit displacement mode saves 4 bytes, if only a short displacement is 
needed, and the relative (register with no displacement) mode requires two bytes 
less. Register + displacement addressing is often used for array and stack refer- 
ences, and local variables. 


Absolute addressing is provided with 16-bit or 32-bit address constants. Abso- 
lute addressing is typically used for references (e.g. calls) to independently com- 
piled code segments, and in the 16-bit form, for references to low memory and 
within small programs. 


It has been observed [Peut77] that a PC-relative address mode would have 
been very useful in the IBM 370, and such modes are provided by CLIPPER. The 
PC can be used with a 16 or 32 bit displacement or with a register (GPR) displace- 
ment. Most of the time, the short displacement should be sufficient; in [Peut77], 
99% of the branches were expressible in 16 bits or less as an offset from the PC. 
PC relative addressing is used primarily for branches and the PC+GPR mode for 
computed gotos and case statements. 


Finally, a two register address mode (relative indexed) is provided, which 
facilitates addressing when both the base and index addresses are in registers, as 
when an array is passed as a parameter. 


It is important to note four aspects of the way the address mode is specified, as 
shown in figure 2. First, the address mode is always defined in the first instruc- 
tion parcel (first two bytes), so there is no (slow) sequential decoding of the 
instruction; subsequent bytes can be immediately routed (as to the adder) without 
further examination. This encoding provides much of the supposed advantages of 
fixed length instructions such as are used in RISC and MIPS. Second, 4 bits are 
used to specify the addressing mode, and only 8 of the 16 possible combinations 
are currently assigned; the remainder are available for future extensions. Third, 
there is no indirect addressing mode, a mode which is very difficult to implement 
efficiently. Finally, we note that some of the address modes result in unused bits 
in some fields, which could be used in the future to generate more than 32 bits of 
virtual address. 


To estimate the frequency of use of the various addressing modes, we note 
data from the literature. In [Peut77], it was found that for the workload examined 
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there, addressing calculations for System/370 RX type instructions used no regis- 
ter 1.1% of the time, one register 85.6% of the time, and two registers 13.3% of the 
time; the RX type instruction forms an effective address as the sum of a 12-bit dis- 
placement and the contents of up to two registers. Data in [Emer84] indicates 
that for the VAX, 61% of the operand addresses were displacement +register, and 
23% were just register. Displacements from a register were most often one byte 
long. For the PDP-11 [Neuh80], most of the operand addresses were specified in a 
register (with or without increment or decrement), and most of the remainder were 
displacement + register. Based on the data cited and further data in [Groc86] and 
[Wiec82], we expect the relative {(R)}, relative with 12-bit displacement {(R)+disp}, 
and the PC Relative with 16-bit Displacement {(PC) +disp} to account for the bulk 
of the address mode use. 


4.2, Instruction Formats 


Figure 2 shows the available instruction formats. Those instructions using 
addresses have been discussed above; here we comment on instructions which do 
not contain memory addresses. 


Register to register instructions are specified in two bytes. Register - immedi- 
ate operations can be specified in 2, 4 or 6 bytes, depending on the size of the 
;mmediate constant. Immediate constants are often small; in [Henn82], is 
reported that 69% of the immediate operands can be encoded in 4 or fewer bits and 
96% in 8 or fewer bits; the corresponding figures from [Groc86] are 60% and 70%. 
The availability of the quick format (which provides a 4-bit unsigned constant) 
and the 16-bit immediate format should greatly aid code density. 


The control opcode is used when the operation requires a small (8-bit) con- 
stant only, as for the calls (system call) instruction. The macro opcodes are those 
used to invoke operations implemented via instruction sequences in the on-chip 
ROM, such as the string move (mouc) instruction. 


5. Instruction Set 


The CLIPPER instruction set is fairly conventional and reflects the experience 
of the designers with respect to two factors: what is needed for convenient and 
efficient programmability, and what can be easily implemented in hardware. 
Table 1 shows the set of opcodes, grouped in such a way as to minimize the redun- 
dant listing of the same opcode for various data types. Most of the entries there 
are self explanatory, and in this section we discuss only those operations that are 
either interesting or worth explaining. 
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5.1. Floating Point 


The CLIPPER microprocessor is unusual in placing its floating point unit on 
the processor chip; the floating point execution unit is also used to compute the 
integer multiplication, division and mod operations. Floating point arithmetic 
operations are performed as specified in the IEEE 754 standard. As noted earlier, 
there is a separate set of 8 floating point registers, and all floating point opera- 
tions are register to register. The floating registers may loaded or stored from/to 
main memory, or from/to the general purpose registers. 


5.2. Branches and Condition Codes 


The approach chosen for CLIPPER for controlling program execution is that of 
condition codes, which are set by one instruction and read and used by a subse- 
quent instruction; this is similar to what is done on the IBM 370. The use of con- 
dition codes for branching yields better performance and less complexity than an 
instruction which both tests and branches. 


There are four standard condition codes: N (negative), Z (zero), V (overflow) 
and C (carry), which are set in the PSW after certain operations. There are five 
floating point exception signalling codes: FX (floating inexact), FU (floating 
underflow), FD (floating divide by zero), FV (floating overflow) and FI (floating 
invalid op). Compare instructions normally set the N and Z flags; since the com- 
pare is executed by performing a subtraction, it is also possible that V and C may 
be set. 


There are two standard branch instructions. Branch on condition tests the 
NZVC PSW bits; the list of possibilities is shown in table 1. The branch on float- 
ing exception tests either for any exception or for a bad result (floating invalid, 
divide by zero, overflow). Branch instructions use the standard addressing modes, 
as defined in figure 2, where the R2 field holds the condition code field that 
specifies the type of branch. 


Implemented directly in the hardwired instruction set are the call and return 
(ret) instructions. The call instruction decrements the stack pointer (defined by 
the register in the R2 field), pushes the address of the next instruction onto the 
stack, and then loads the PC with the target address. Return reverses the process. 


5.3. Macro Instructions 


The CLIPPER processor chip includes a smal] ROM (known as the Macro 
Instruction ROM), which holds various useful code sequences. Approximately half 
of the MIROM is devoted to diagnostic code, to be used for chip testing and sorting 
during manufacturing. The remainder implements complex operations which are 
often found as single (usually microcoded) instructions on CISC machines. Imple- 
menting these functions as MIROM sequences increases code density and 
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readability, instruction fetch penalties (misses, sequential fetch delays) decrease, 
and less instruction cache space is used. 


A Macro instruction actually represents a branch into the ROM; the instruc- 
tion fetch unit starts fetching instructions from the ROM at the address specified 
by the macro opcode. In this section, we briefly discuss those instructions imple- 
mented in the MIROM; the operation of the MIROM is described in more detail in 
section 7.2. 


Instructions to save and restore general registers (save registers (savewn), 
restore registers (restwn), save floating registers (savedn) and save user registers 
(saveur)) are implemented in the MIROM as a sequence of consecutive store (or 
load) operations, starting from a given register number and continuing through 
register 14. The floating point register saves and restores are implemented simi- 
larly. 


Three string (storage to storage) instructions are currently implemented in 
the MIROM. These are mouc (copy a string of characters from/to nonoverlapping 
fields), initc (initialize a string with the contents of a register - primarily used for 
clearing buffers), and cmpc (compare two character strings). These instructions 
may be interrupted and restarted. 


All of the conversion operations, and negate floating, scale by, and load float- 
ing status (see table 1) are implemented in the ROM. 


The return from interrupt (reti) instruction restores the processor state after 
trap or interrupt processing, and is discussed in more detail in section 6.1. The 
wait for interrupt (wait) instruction causes the processor to halt pending the 
arrival of an enabled interrupt. The interrupt routine then determines whether to 
continue execution. 


5.4. Test and Set 


The cost and performance advantages of multiple microprocessor computer 
systems sharing a common memory are currently quite compelling [Smit85]. The 
Test and Set (tsts) instruction is the instruction chosen for CLIPPER for the imple- 
mentation of locks to be used in multiprocessor and multiprocess synchronization. 
As a single, indivisible operation, it (a) loads the contents of a main memory loca- 
tion into a specified GPR, and (b) sets bit 31 of the given main memory word to 1. 
Indivisibility is achieved by (a) making the lock word non-cacheable, and (b) hold- 
ing the main memory bus for the entire operation (which is a read / modify / 
write). A processor may either loop, continually testing the lock until it is 
released, may use the wait instruction to sleep, or may task switch. Test and Set 
is also used by the IBM 370 and the M68000; the VAX provides seven instructions 
for locking and synchronization, some of which are equivalent to test and set. 
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5.5. Opcode Assignment 


As shown earlier in figure 2, the high order byte of the first parcel of each 
instruction contains the instruction opcode. The assignment of bits to opcodes is 
shown in figure 3. 


The important observation to be made from figure 3 is that of the possible 256 
operation codes available from 8 bits, 85 instructions (including sets of instruc- 
tions) are defined, and 104 of the bit combinations are used. (Not shown in figure 3 
are some opcodes used to implement instructions which may be executed only from 
the MIROM.) That leaves over 140 possible opcodes for future expansion. In gen- 
eral, we have made a conscious effort to allow the CLIPPER architecture to evolve 
with user needs and technology trends, and reserving a significant number of 
opcodes is one part of that effort. 


6. Interrupts, Traps and Supervisor Calls 


The CLIPPER microprocessor provides for 402 exception conditions: 18 
hardware traps, 128 programmable supervisor calls and 256 vectored interrupts. 
The number of hardware traps can be expanded to 128 at some future time. 


A trap is an exception that relates to a condition of a single instruction, e.g., 
page fault, memory error, overflow, etc. Interrupts are events signalled by devices 
external to the CLIPPER module. 


6.1. Intrap and Return Sequences 


The recognition by the hardware of a trap or interrupt causes entry to a 
macro instruction sequence, INTRAP, which in noninterruptable mode performs a 
context switch to supervisor mode, stores the PC, PSW and SSW on the supervisor 
stack, and transfers control to the trap or interrupt handler through the Vector 
Table. The Vector Table is a table in low memory containing 2-word entries; each 
entry contains the address of the trap or interrupt handler and the new SSW. The 
reti (return from interrupt) sequence is a non interruptable sequence which 
restores the system to the correct user or supervisor environment. Interrupts and 
traps are prioritized, with logic within the processor giving service to the highest 
priority event. Traps are permitted during interrupt and trap handling but result 
in an unrecoverable fault; page fault traps must be avoided during fault handling. 


6.2. Traps 


When a trap occurs, all instructions prior to the trapping instruction are com- 
pleted (including those in the floating point unit), and all instructions subsequent 
to the trapping instruction are flushed from the pipeline. 
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It is possible to classify traps into several groups: data memory, floating point 
arithmetic, integer arithmetic, instruction memory, illegal operation, diagnostics 
and supervisor calls. 


Data memory and instruction memory traps include correctable and uncorrect- 
able memory errors, page faults, and protection faults. In each case, the exception 
is recognized by the CAMMU which maintains in the TLB copies of the protection 
bits taken from the page table entries. 


The five floating point arithmetic traps are invalid operation, inexact result, 
overflow, underflow and divide by zero. There are trap enable flags for each of 
these in the PSW, and also exception flags in the PSW which are set when the 
corresponding events occur. There is an overall floating point trap enable flag 
(also in the PSW) which may be used to disable all floating point traps. 


The trace trap causes a trap at the end of the current instruction. An MIROM 
sequence is considered to be a single instruction for tracing purposes. Tracing is 
disabled on entry to the INTRAP sequence and trace trap handler. 


Supervisor calls are implemented as traps triggered by the calls instruction. 
There are potentially 128 supervisor call codes; the CLIX™®” system (the Fairchild 
port of Unix) [Neff86b] uses approximately 60 of them. 


6.3. Interrupts 


Interrupts are signalled externally to the processor and appear as signals on 
the interrupt pins of the system bus. An interrupt is taken only when: (a) no 
traps are pending except the trace trap, (b) interrupts are enabled, (c) all instruc- 
tions currently in the pipeline have completed, and (d) string instructions have 
either completed or have saved sufficient state to be able to restart. (Long string 
instructions will periodically test for pending interrupts, and if there are any, will 
save their state and permit the interrupt to be processed.) With the exception of 
the string instructions, interrupts are not accepted during MIROM sequences. 


There are 16 prioritized interrupt levels, with 16 interrupts of equal priority 
within each level. Interrupt processing can be interrupted by an event of higher 
priority. 


7. Design Architecture 


As explained earlier, the term “design architecture” refers to the architectural 
implementation at a fairly high level. We discuss the design architecture of the 
CLIPPER CPU in this section. 


*CLIX is a trademark of Fairchild Semiconductor Corporation 
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Figure 4 shows the major components of the CLIPPER processor and the 
major interconnections in a simplified fashion. Somewhat more detail is shown in 
figure 5. As can be seen from those figures, the processor is divided into 6 princi- 
pal sections: the Instruction Bus Interface (including an instruction prefetch 
buffer), the Macro Instruction Unit, the Instruction Control Unit, the Floating 
Point Unit, the Integer Execution Unit, and the Data Bus Interface. We discuss 
each of these in this section. Table 2 shows the fraction of the chip area occupied 
by various processor sections; the remainder of the area (to the total of 100%) is 
occupied by empty space or other minor components. 


7.1. Instruction Bus Interface 


The instruction bus (described in more detail in [Cho86]) is a bi-directional 
46-line bus connecting the CPU chip to the Instruction CAMMU. The interface 
contains receivers (RCV) and drivers (DRV), and a 64-bit (8-byte) instruction 
buffer. Instructions are prefetched into this buffer, and are then fed into the 
instruction control unit as needed. A branch never hits in this buffer, as there is 
no mechanism to detect that a branch target address is within the buffer; on a suc- 
cessful branch, the instruction buffer is cleared. The Instruction CAMMU con- 
tains its own instruction counter, and will feed the next 4 bytes of the instruction 
stream into the instruction buffer every time the next instruction line of the 
instruction bus is pulsed. While within a cache line, the ICAMMU can deliver 4 
bytes every 2 CPU cycles (60ns), and the CPU can at its maximum rate, execute 
2-bytes (one parcel, or one 2-byte instruction) every CPU cycle (30ns). 


Also associated with the instruction bus interface is a multiplexor (MUX) 
which can accept instructions from either the instruction buffer or the Macro 
Instruction ROM and feed them to the instruction control unit. 


7.2. Macro Instruction Unit 


The Macro Instruction ROM (MIROM) is an on-chip ROM (1K entries x 47 
bits) which implements complicated instructions as sequences of simpler hardwired 
instructions; the opcode for the MIROM implemented instruction is effectively a 
branch target address into the ROM. Each entry in the MIROM contains two 
instruction parcels plus the next instruction address and a stop bit. 


The set of legal opcodes for ROM instructions is a superset of the standard 
instruction set, including, for example, the conditional branch within the MIROM 
itself; those ROM-only instructions are not shown in table 1 or figure 3. 

In addition to the regular registers, there are 16 scratch registers (12 regular 
and 4 floating point) accessible only from instructions in the MIROM. The 
instructions in the MIROM also have a mechanism to reference the registers 
specified by the Rl and R2 fields of the Macro instruction (see figure 2). 
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Figure 5. Processor Design Architecture 


Figure 6. Pipeline Structure 
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7.3. Integer Execution Unit 


The integer execution unit contains the general register file (16 user GPRs, 16 
supervisor GPRs and 12 scratch registers), the shifter, and the ALU. The register 
file has three ports, permitting two reads and one write during the same machine 
cycle. 


The shifter implements the shift and rotate instructions and is designed as a 
serial double bit shifter. Single and double bit shifts occur in one cycle; larger 
shifts require multiple cycles. Data in [Huck83] shows that for his System/370 
workload, only 1.9% of all shifts were for more than 3 bits. 


The ALU (arithmetic / logic unit) implements integer addition and subtrac- 
tion, bitwise logical operations, and register to register transfers. The address 
mode additions are also performed by the ALU; each requires only one pass 
through the ALU, since no address computation requires more than one add. 


7.4. Floating Point Unit 


CLIPPER is unusual among current microprocessors in having its floating 
point unit on chip. Multiplication uses a Booth algorithm [Cava84] which pro- 
duces products iteratively, two bits per clock cycle. Typically, one clock time is 
needed for round and one for normalize. Division uses a nonrestoring shift and 
subtract algorithm, producing 1 bit per clock. Associated with the FPU is the float- 
ing point register file, which contains eight regular and four scratch 64-bit floating 
point registers; the latter are accessible only from code running in the Macro 
Instruction ROM. The floating point unit is also used to perform integer multiply 
and divide. 


The floating point unit operates in parallel with respect to the rest of 
CLIPPER. Although only one floating point operation can be in execution at any 
one time, operations which neither use the FPU nor rely on its output can be 
issued steadily while the FPU completes the current operation. The result is that 
much of the execution time for floating point operations will overlap that of other 
instructions. 


Floating point exceptions may be out of sequence with respect to the rest of 
the instruction stream. When a floating point trap occurs, the address of the float- 
ing point instruction may be recovered from a special register; the PC value 
pushed on the system stack can be quite far from the address of the trapping 
instruction. 


7.5. Data Bus Interface 


The data bus interface consists principally of receiver and driver circuits for 
the data bus, and a shifter for aligning byte and half word operands. It is 
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connected to all of the major functional units of the CPU via the S-bus, (shown in 
bold in figure 5) so that it can receive and deliver operands in the most expeditious 
manner. 


7.6. Instruction Control Unit and CPU Pipeline 


The heart of the CLIPPER processor is the instruction control unit (ICU), 
which is responsible for decoding instructions and controlling instruction execu- 
tion. The ICU is shown in figure 5, and the reader should also note figure 6, 
which diagrams the operation of the instruction execution pipeline. 


In the ICU are several components. The program counter contains the 
address of the instruction about to be issued; to issue an instruction means to 
allow it to run to completion (i.e. modify registers or memory), provided no traps 
occur. Shown in figure 6 are two boxes, called the “B stage” and “C stage”. Each 
consists of a set of decoding logic and registers for holding partially decoded 
instructions and the corresponding instruction address. The B stage is responsible 
for instruction decoding and resource management; resource management keeps 
track of which functional units are busy and allows instructions to advance to the 
issue stage only if the necessary units are available. The C stage holds the fully 
decoded instruction, and controls the operation of the integer execution unit and 
the floating point unit. The J register (figure 5) is used to hold immediate values 
(including address offsets and address constants). Also located in the ICU are the 
PSW and SSW registers. 


There can be one instruction in each of the B and C stages. Shown preceding 
the B stage is the instruction buffer (IB) which holds 4 parcels (8 bytes) of instruc- 
tions, or up to four instructions. 


The last stage of the pipeline consists of parallel integer and floating point 
execution units. These two execution units can operate in parallel, with one active 
instruction in the FPU and one instruction in each of the three stages of the 
integer execution unit (IEU). Those three stages are operand fetch (L stage), 
arithmetic (A stage: ALU or shifter) and operand write (O stage - to either regis- 
ters or elsewhere). It takes three cycles for an instruction to pass through the IEU 
- one to read from the registers into the ALU, one to pass through the ALU or 
shifter, and one to write the results. There is a bypass from the output of the 
ALU to the input, so that results can be immediately reused in the next instruc- 
tion. 


7.7. Layout, Area, and Physical Parameters 


A photograph of the CLIPPER CPU chip, on which functional areas are indi- 
cated, is shown in figure 7. Table 2 shows the fraction of the chip used for various 
purposes. The chip is implemented using 2-micron CMOS, with two levels of 
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metal interconnect with a 6.5 micron pitch, one polysilicon level with 2.0 micron 
gates and a 4.0 micron pitch, a 250 Angstroms thick gate oxide, and 2.0 micron 
contacts and vias. Transistor switching speeds range from .5ns to 3.0ns, depend- 
ing on gate size and load. The chip dissipates 0.5 ‘vatts. The processor cycle time 
is 30ns, which is also the minimum time to execute an instruction. The power 
supply is required to provide 0 and +5 volts. The processor chip has 132 pins. 
The overall chip size is 185,000 square mils; the package is .9 in. sq. and is sur- 
face mounted. 


8. Performance 


CLIPPER was conceived of and designed as a high performance processor, and 
as has been noted throughout this paper, design decisions and tradeoffs have been 
made whenever possible in the direction of higher performance. That high perfor- 
mance has indeed been achieved is evident from the instruction execution times 
shown in table 3. As can be seen, the minimum instruction execution time is one 
CPU cycle time, or 30 ns. The peak program execution rate is thus 33 MIPS. 


Benchmark timings have been obtained both via an instruction set timing 
simulator and from runs on a real machine using early versions of the various 
compilers. The simulator shows an average of 5-6 clock cycles per instruction 
including memory delays, or about 5-7 MIPS (measured in CLIPPER instructions). 


Because the power of various instruction sets varies greatly, simply quoting a 
MIPS figure is not very meaningful. For that reason, various standard bench- 
marks have been run on a real CLIPPER. The CLIPPER system used had a rela- 
tively slow memory (2 wait states), and the compilers used have been available 
from less than a year (C) to less than a month (Fortran), so a high performance 
CLIPPER “box”, using mature software, should do considerably better than the 
results presented here. The compiler version under current development (but not 
yet released) shows 10% to 20% better performance on the existing hardware. All 
of the results presented below were run on production hardware (during January, 
1987) at Fairchild, by the same people using the same code under the same condi- 
tions, and should be comparable and accurate. 


Table 4 shows the results of the Dhrystone [Weic84], Whetstone [Curn76}, 
Linpak [Dong83] and Berkeley [Hans82] benchmarks. It can be seen that 
CLIPPER is about 25% faster than the VAX 8600 (VMS 4.1) on the Dhrystone 
benchmark and is about nine times the speed of a VAX 11/780. (Versions 1.0 and 
1.1 refer to the two versions of the Dhrystone benchmark.) For the Whetstone 
benchmark, CLIPPER is about 3 times the speed of the VAX 11/780 (with the 
floating point accelerator) and about 7.5 - 9.5 times as fast as the VAX 11/785 
(without an FPA) under Ultrix. Table 4 shows that CLIPPER is about 3.15 times 
as fast as the VAX 11/780 (VMS 3.7) on single precision Linpack and is about 3.11 
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times as fast with double precision. (The VAX Linpack benchmarks were all run 
using the standard VMS Fortran library. The “Fortran BLAS” results are with 
the Fortran BLAS routines from Los Alamos. The "coded BLAS" shows the results 
after further hand tweaking.) The Berkeley benchmarks show CLIPPER from 3.5 
to 13 times as fast as the VAX 11/780 under VMS 3.7. Using the VAX 11/780 as 
a canonical “1 MIPS” machine, CLIPPER is about a 5-6 MIPS machine. (In actual 
fact, the VAX has a CISC instruction set, and thus generally runs at about .5 
MIPS [Emer84]. The “canonical 1 MIPS” refers to a System/370 scientific work- 
load running on a System/370 instruction set machine.) 


8.1. Performance vs. Cycle Time and Cycles/Instruction 


For a given instruction set architecture, CPU performance is inversely propor- 
tional to the product of cycle_time and cycles/instruction. CLIPPER achieves its 
high level of performance via a careful tradeoff of these two factors, in contrast to 
the “one size fits all” approach that is currently popular in some quarters. The 
design philosophy espoused by MIPS [Henn82] and RISC [Patt85]} is that all, or 
almost all instructions must execute in one cycle; this implementation approach 
was previously used by Procrustes [Bull55] in matching guests to beds. 


The disadvantage to the single cycle per instruction approach is that not all 
instructions are equally complex, and the cycle time must accommodate the long- 
est single cycle instruction; conversely, partitioning an instruction into a larger 
number of sequential phases provides more possibilities for overlap. For these rea- 
sons, the CLIPPER designers chose to implement the instruction set in the 
manner of a traditional mainframe, whereby the longer and more complex instruc- 
tions are permitted more cycles to complete. The CPU cycle time (30ns) was 
chosen as a design goal, on the basis that the technology available at the time of 
chip fabrication would permit the basic instructions (e.g. add, logical operations) to 
complete in one cycle. Longer instructions were allowed to take as many cycles as 
necessary, and the appropriate hardware support was placed on-chip to ensure that 
they executed correctly in the presence of traps, interrupts, and data and register 
dependencies. 


8.2. Performance Improvement 


There are two approaches to improving the performance of an implementation 
of a given instruction set architecture. The first is technology scaling, by which 
faster technology and denser packaging (or a smaller chip) permit the machine to 
run faster, without any changes in the design architecture, or even in the circuit 
diagram. 

It is important to note that (for the most part) performance improvements in 
scaling from one technology (e.g. 2-micron CMOS) to another (e.g. 1.25-micron 
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CMOS) are independent of the actual absolute value of the cycle time. The cycle 
time in a machine is limited by the longest signal path (including gate delays) 
within a cycle; halving the longest path permits almost halving the cycle time. 
Scaling of chip technology to obtain a higher performance CLIPPER implementa- 
tion is underway, even though CLIPPER already has an impressively fast cycle 
time. 


In considering the performance of CLIPPER, evidence supports comments in 
[Mate84], where it is noted that the factor most strictly limiting performance on a 
high performance microprocessor is the memory interface. As is discussed in more 
detail in [Cho86, Holl87}], CLIPPER is most strictly limited by memory delays, 
despite the use of two busses (one each for addresses and data), the fact that those 
busses are short, and that each is dedicated to communication between a pair of 
chips. In scaling any processor, the limiting factor will continue to be the memory 
interface, and that does not scale as well as other aspects of the machine. 


The other approach to improved performance is a redesign which decreases 
the number of cycles per instruction. In general, this can be accomplished by the 
use of more logic. For example, a multiplier or adder can always be made faster 
with the addition of more gates; thus the current multiplication time (table 3) 
could be reduced. Similar improvements are possible in other multicycle instruc- 
tions. For comparison, we note that the Amdahl 470V/6 required 5-6 cycles per 
instruction, and that was roughly halved for the 580 . The DEC VAX 11/780 
needed about 10 cycles per instruction [Emer84] and that was reduced to about 6 
cycles for the 8600 [Foss85]; the cycle time was only reduced from 200ns to 80ns, 
but the total performance was improved by a factor of almost five. 


Projects to improve CLIPPER performance by both technology scaling and the 
addition of additional: (design architecture) features are underway. 


9. Tradeoffs and Extensions 
9.1. Instruction Set Choice 


9.1.1. Why Not “Pure RISC”? 


As noted above, the current research trend in computer architecture is to 
design machines with extremely simple instruction sets. Despite some advantages 
to such an approach, our feeling is that the instruction set should be made as sim- 
ple as possible, but no simpler. Very simple instruction sets impose burdens on 
the compiler and the assembly language programmer, and increase code volume 
and I-cache traffic and miss ratios. Single cycle instruction execution for (almost 
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all) instructions results in less than optimal instruction overlap. By restricting 
ourselves to a load / store architecture, but permitting variable length instructions 
with variable length execution time, we believe that we've created a design which 
is both functional and allows efficient implementation. 


In particular, the CLIPPER microprocessor architecture was designed to 
include string instructions (implemented in the on-chip ROM), on-chip floating 
point, hardware support for the TLB, hardwired pipeline interlocks, and interrupt 
and trap sequences (in the MIROM). We believe that given the current and likely 
future state of the art, these represent a good tradeoff. 


9.1.2. Why Not “More CISC”, and What We Chose Not To Include 


There is a certain intellectual appeal to taking commonly needed software 
functions and implementing them in single instructions. Extreme examples are 
instructions to manipulate queues and compute polynomials, but we can include 
such reasonable operations as the three memory address instruction in this class. 
There are several problems with this approach. First, we note that the number of 
gates available on a chip in current technology is not sufficient to implement these 
instructions entirely in hardware; microcode would have been required. Existing 
microcoded machines tend to be slow. Other issues are discussed below. 


There are a number of instructions and features that were deliberately omit- 
ted from the CLIPPER ISA, and we comment specifically on some of them here. 


A natural form of computation is memory to register, register to memory, or 
memory to memory, but such instructions are not provided. There are three rea- 
sons for this: (a) It is very simple to generate the corresponding code sequences. 
Very few extra instruction bytes are needed, since the total number of operand 
specifiers is the same. (b) There is usually little savings in execution time, since 
the same sequence of operations must occur. (c) There is considerable additional 
complexity, because of the problems of memory traps and interrupts, especially 
page faults. In particular, if there are multiple memory references per instruction, 
then there can be multiple page faults; an extreme case of this problem occurs 
with the M68000 which permits an indirect indexed address mode. 


Some complicated instructions seen in the IBM 370 and DEC VAX (eg. 
translate, translate and test, edit, queue, polynomial, etc.) were omitted due to 
their substantial complexity, and the fact that the same functionality can be rea- 
sonably implemented in software. In practice, a compiler is seldom able to gen- 
erate these instructions even when they are needed. All existing studies show 
that a small number of opcodes account for the large majority of all instructions 
executed; see e.g. [Peut77, Clar82]. For many of the same reasons, we omitted 
complicated branch instructions (such as decrement, test, and branch if less than 
zero). 
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Protection domains were limited to those possible from the protection bits 
assigned to page frames (see [Cho86] for further discussion), since very few operat- 
ing systems are prepared to take advantage of ring-structured protection domains 
or similarly complex designs. Likewise, a segmented address space was avoided, 
due to the inflexibility it imposes on the use of memory, including the impedi- 
ments it provides to increases in the address space size, and the fact that the same 
functionality is obtained by protection bits on pages. General purpose registers 
were selected over dedicated registers (e.g. index, data and address registers) for 
programming flexibility and generality. 

There is no need for a compatibility mode in CLIPPER, since it is not an 
upward compatible extension of an existing architecture. Not having to provide 
this feature greatly simplified the design, avoided undesirable architectural 
compromises, and permitted increased performance. 


Extended precision arithmetic was not considered to be sufficiently useful at 
the time CLIPPER was designed to justify the difficulty of implementing it on a 
chip so tightly constrained with regard to area. Extended precision can be 
obtained currently with instruction sequences, and opcodes are available to imple- 
ment extending precision in the hardwired instruction set at some future time. 


9.1.3. Possible Additions 


One of the limiting factors in the design of a microprocessor is the silicon area 
available and the area required for each gate. For that reason, some features origi- 
nally considered were deferred until future CLIPPER versions, when technology 
advances sufficiently. For example, a delayed branch has the advantage of reduc- 
ing the pipeline penalty due to successful branches. The problem with a delayed 
branch is that of saving the state, when a trap or interrupt occurs between the 
time the branch is selected (the delayed branch instruction) and the time that it 
takes effect (one or two instructions later). The existing CLIPPER chip simply 
doesn’t have the space on it for the necessary logic to implement this correctly. In 
addition to the delayed branch, a delayed load and vector instructions are under 
consideration. 


9.2. Pipeline Control 


CLIPPER is pipelined, and the pipeline is fully hardware controlled, with all 
interlocks (including checks for register dependencies) enforced with hardwired 
logic. This is in contrast to designs such as RISC [Patt85} and MIPS [Chow86], 
where the compiler must reorganize code and insert noops as necessary. We chose 
to use hardware control deliberately, as we believe that: (a) it is an unreasonable 
burden to require that the compiler understand the pipeline and insert noops as 
necessary; (b) it is an unreasonable burden on the assembly language programmer 
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and/or code generator to require that he overcome the lack of hardware; (c) the 
implications of (a and (b are that without interlocks, code will tend to be “buggy”; 
(d) compilers and programs become implementation dependent; instead of just 
depending on the instruction set architecture, they depend on the precise features 
of the pipeline. Object code is thus not portable between different implementa- 
tions of the same instruction set architecture. 


9.3. On-Chip Cache or Larger Instruction Buffer 


Considerable study was devoted to the question of whether CLIPPER should 
have an on-chip cache or a significantly larger instruction buffer than the current 
8-bytes. We do not have space here to discuss the reasons for the existing choice 
in detail (see [Cho86]) but we briefly note some of the points. The basic problem is 
that given the limited chip area, we were unable to put enough cache or buffer on 
the chip to yield a useful performance improvement. In addition, there is the 
problem of virtual vs. real addressing, synonyms, cache flushing, and cache con- 
sistency [Smit82]. For a future design, a 2- or 3-level cache (on chip, CAMMU 
chip, cache board) is a possibility. 


9.4. Address Space Size 


An issue which has been emphasized throughout this paper is that of address 
space extensibility. Almost any shortcoming in a computer architecture can be 
overcome except too small an address space; this is the reason that DEC was 
finally forced to design the VAX (“virtual address extension”) as a replacement for 
the PDP-11. CLIPPER provides a flat, uniform (not partitioned) 32-bit address 
space. Because of the availability of additional address modes, it will be possible 
to define modes which produce more than 32 bits of virtual address. More than 32 
bits of physical addressing can be obtained by changing the format of the page 
tables. These changes are straightforward and would require few if any user pro- 
grams to undergo conversion. We expect that within 10 or 15 years, both physical 
and virtual addresses will need more than 32 bits. 


9.5. Better Multiprocessor Cache Consistency 


As explained in [Cho86], the CLIPPER CAMMU implements a bus watch 
cache consistency protocol; it watches memory transactions on the bus, and main- 
tains cache consistency in a system with multiple CPUs and shared writeable 
areas of memory. The algorithm implemented requires that shared writeable data 
be marked, and thus the CAMMU only need take action when the reference is 
marked shared. Because consistency operations involve significant performance 
costs, the use of this mode should be minimized. With improved technology, we 
expect that it will be possible to implement a much more sophisticated bus 
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interface, with a dual ported cache directory, and an optimized consistency algo- 
rithm such as is described in [Swea86]. 


10. Conclusions and Overview 


In this paper, we’ve discussed the instruction set architecture and the imple- 
mentation of the Fairchild CLIPPER microprocessor. The machine was designed 
from scratch to provide high performance, convenient programmability and the 
ability to extend the architecture as technology improves and the art of computer 
architecture design advances. Our discussion has included both functional descrip- 
tion (concentrating on those functions that are interesting and/or unusual) and a 
significant consideration of design tradeoffs and choices. We believe that 
CLIPPER not only represents a good set of choices, but that this paper is impor- 
tant in discussing and documenting those tradeoffs. 
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